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■ Abstract 

We propose a new gradient projection algorithm that compares favorably with 
the fastest algorithms available to date for ^i-constrained sparse recovery from noisy 
data, both in the compressed sensing and inverse problem frameworks. The method 
exploits a line-search along the feasible direction and an adaptive steplength selection 
based on recent strategies for the alternation of the well-known Barzilai-Borwein rules. 
The convergence of the proposed approach is discussed and a computational study 
on both well-conditioned and ill-conditioned problems is carried out for performance 
i evaluations in comparison with five other algorithms proposed in the literature. 

■ 1 Introduction 

There has been a vast amount of recent literature dedicated to algorithms for sparse 
recovery, both in the context of inverse imaging problems and of compressed sensing. As 
an alternative to the usual quadratic penalties used in regularization theory for ill-posed or 
ill-conditioned inverse problems, the use of ^i-type penalties has been advocated in order 
to recover regularized solutions having sparse expansions on a given basis or frame, such 
as e.g. a wavelet system [14]. Denoting by x G W the vector of coefficients describing the 
unknown object, by y £ M. n the vector of (noisy) data and by K the linear operator (nxp 
matrix) modelling the link between the two, the inverse problem amounts to finding a 
regularized solution of the equation Kx = y. When it is known a priori that a: is a sparse 
vector, one can resort to the following penalized least-squares strategy [9], also referred to 
as the lasso after Tibshirani [29]: 

x(X) = argnun \\Kx — y\\ 2 + 2A[|sc||i (1) 

where A is a positive regularization parameter regulating the balance between the penalty 
and the data misfit terms. The norm || • || denotes the usual £2 norm whereas [|ac||i = 
zCf=i \ x i\ ^ S the norm of the vector x. 

In compressed sensing (also called compressive sampling), the aim is to reconstruct 
a sparse signal or object from a small number of linear measurements [6, 7, 16]. The 
recovery of such an object can then be achieved by searching for the sparsest solution 
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to the linear system Kx = y representing the measurement process, or equivalently by 
looking for a solution with minimum "£o-norm". To avoid the combinatorial complexity 
of the latter problem, one can use as a proxy a convex ^-norm minimization strategy. 
When the data y are affected by measurement errors, the problem is reformulated as a 
penalized least-squares optimization analogous to (1). 

Let us observe that problem (1) is equivalent to the constrained minimization problem: 

x(p) = arg min \\Kx — y\\ 2 (2) 

INk<p 

for a certain p. One can show that x(X) and x(p) are piecewise linear functions of A and p. 
One always has that x(X) = for A > A max = maxj \ (K T y))i\. The relationship between 
A and p is given by A = maxj \(K T (y — Kx(p)))i\ and p = ||x(A)||i [15]. 

2 Iterative minimization algorithms 

Several iterative methods for solving the minimization problems (1) or (2) have been 
proposed in the literature. For the purpose of comparison with our new acceleration 
scheme, we will focus on the following algorithms: 

1. The Iterative Soft-Thresholding Algorithm ("ISTA") proposed in [14, 19, 8]) goes as 
follows: ;r( fc+1 ) = S\[x^ +j-( fc )] where = K T (y — Kx^) is the residual in step 
k and the (nonlinear) soft-thresholding operator acts componentwise as (S^fsc]^ = 
Xi — A sgn(xj) if \xi\ > A and zero otherwise. For any initial vector x^ and under the 
condition \\K\\ < 1, this scheme has been shown to converge to the minimizer x(\) 
defined by (1) [14]. When reinterpreted as a forward-backward proximal scheme, 
convergence can be seen to hold also for \\K\\ < y/2 [10]. 

2. The Fast Iterative Soft-Thresholding Algorithm ("FISTA"), proposed in [2], is a 
variation of ISTA. Defining the operator T by T{x) = S\[x + K T (y — Kx)], the 
FISTA algorithm is: 

=T^(*) + ^i( a? W- aJ (*- 1 ))j > (3) 

where a;( ) = 0, = 1 +v / i+4(* (fc) ) 2 and t (o) = L It has virtua i ly the same 

complexity as the ISTA algorithm, but can be shown to have better convergence 
properties. 

3. The GPSR algorithm proposed in [20]. 

4. The SPARSA algorithm proposed in [30]. 

5. The Projected Steepest Descent ("PSD") method proposed in [15]: x^ k+ ^ = P^[x^ + 
/?(*M fc )], with /3( fc ) = ||r( fc )|| 2 /||Kr( fe )|| 2 . P n denotes the projection onto the €i-ball 
$7 of radius p. 

The Figures in Section 4 provide a visual way to compute the performance of these algo- 
rithms in two problem examples. Note that these are the same as in [25], where the reader 
can find comparisons to yet other methods, including e.g. the £±-ls method, an interior 
point algorithm proposed in [24]. 
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3 Gradient Projection with Adaptive Steplength Selection 



In this section we describe the acceleration scheme we propose for solving the optimization 
problem (2). This problem is a particular case of the general problem of minimizing a 
convex and continuously differentiable function f(x) over a closed convex set fi C MP. 
Here = {x G W, \\x\\i < p}. A gradient projection method for solving this problem 
can be stated as in Algorithm GP. 

Some comments about the main steps of Algorithm GP are in order. 
First of all, it is worth to stress that any choice of the steplength in a closed interval is 
permitted. This is very important from a practical point of view since it allows to make 
the updating rule of a.k problem-related and oriented at optimizing the performance. 
If the projection performed in step 2 returns a vector equal to x^ k \ then x^ is a 
stationary point and the algorithm stops. When / x^ k \ it is possible to prove that 
is a descent direction for / in x^ and the backtracking loop in step 5 terminates 
with a finite number of runs; thus the algorithm is well defined [3, 4, 5]. 
The nonmonotone line-search strategy implemented in step 5 ensures that f{x^ k+l ^) is 
lower than the maximum of the objective function in the last M iterations [23]; of course, 
if M = 1 then the strategy reduces to the standard monotone Armijo rule [3]. 

Concerning the convergence properties of the algorithm, the following result can be 
derived from the analysis carried out in [4, 5] for more general gradient projection schemes: 
if the level set VLq = {x 6 f2 : f(x) < f(x^)} is bounded, then every accumulation point 
of the sequence {x^} generated by the Algorithm GP is a stationary point of f(x) in U. 
We observe that the assumption is trivially satisfied for problem (2) since in this case the 
feasible region is bounded. 

Now, we may discuss the choice of the steplengths G [a m in, "max]- Steplength 
selection rules in gradient methods have received an increasing interest in the last years 
from both the theoretical and the practical point of view. On one hand, following the 
original ideas of Barzilai and Borwein (BB) [1], several steplength updating strategies 
have been devised to accelerate the slow convergence exhibited in most cases by standard 
gradient methods, and a lot of effort has been put into explaining the effects of these 
strategies [11, 12, 13, 18, 21, 22, 32]. On the other hand, numerical experiments on 
randomly generated, library and real-life test problems have confirmed the remarkable 
convergence rate improvements involved by some BB-like steplength selections [12, 13, 
20, 21, 28, 31, 32]. Thus, it seems natural to equip a gradient projection method with a 
steplength selection that takes into account the recent advances on the BB-like updating 
rules. 

First of all we must recall the two BB rules usually exploited by the main steplength 
updating strategies. To this end, by denoting with / the p x p identity matrix, we can 
regard the matrix B(a.k) = (afcl)^ 1 as an approximation of the Hessian V 2 f{x^) and 
derive two updating rules for by forcing quasi-Newton properties on B(a^): 

af 31 = argmin ||B( afc ) a (*-i)_ z (*-i)|| an d a jP 2 = argmin \\s^ k ~^ - B(a k )~ 1 z^ k ~ 1 ^\\, 

(4) 

where s^" 1 ) = x^ - x^ k ~^ and z^ k ~^ = Vf(x^) - Vf(x (k -^). In this way, the 
steplengths 

_ BBl _ s s _ BB2 _ s z /r\ 

(Xu — m , OLu — ^ , 10) 

S (k-1) T z (k-l) Z (k-1) T z (k-l) 

are obtained. 
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Algorithm GP (Gradient Projection Method) 



Choose the starting point a;^ G £1, set the parameters 0, 6 € (0, 1), < a m i„ < a m ax and 
fix a positive integer M. 

FOR k = 0, 1, 2, ... DO THE FOLLOWING STEPS: 

Step 1. Choose the parameter ol\. £ [a m in; a max ]; 

Step 2. Projection: /j,( fe ) = P n (x^ - a k Vf(x^)); 

If = then stop, declaring that x^ is a stationary point; 

Step 3. Descent direction: = h {k) - x^; 

Step 4. Set A fc = 1 and / max = max < i < min(feiM _ 1) f {x^' 3 ")); 

Step 5. Backtracking loop: 

If + \ k d^) < / max + (3\ k Vf(x^) T d^ then 

go to Step 6; 

Else 

set Afc = 9\ k and go to Step 5; 
Endif 

Step 6. Set x^+V = x^ + \ k d {k) . 
End 



At this point, inspired by the steplength alternations successfully implemented in recent 
gradient methods [21, 32], we propose a steplength updating rule for GP which adaptively 
alternates the values provided by (5). The details of the GP steplength selection are given 
in Algorithm SS. This rule decides the alternation between two different selection strate- 
gies by means of the variable threshold r k instead of a constant parameter as done in [21] 
and [32]. This trick makes the choice of to less important for the GP performance and, in 
our experience, seems able to avoid the drawbacks due to the use of the same steplength 
rule in too many consecutive iterations. In the following we denote by GPSS the algorithm 
GP equipped with the steplength selection SS. 

We end this section by describing the setting for the GPSS parameters used in the com- 
putational study of this work: 

• line-search parameters: M = 1 (monotone line-search), 9 = 0.5, [3 = 10~ 4 ; 

• steplength parameters: a m i n = 10~ 10 , a max = 10 10 , 

a = max{a min ,min{||P n (a;( ) - Vf(x^))\\^\ a max }}, n = 0.5, M a = 2. 

In our experience the above setting often provides satisfactory performance; however, it 
can not be considered optimal for every application and a careful parameter tuning is 
always advisable. 

4 Numerical experiments 

To assess the performances of our GPSS algorithm and estimate the gain in speed it 
can provide with respect to the algorithms 1 to 5, we perform some numerical tests. 
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Algorithm SS (Steplength Selection for GP) 
if k = THEN 

set Qo £ [amimttmax], T\ G (0, 1) and a non-negative integer M Q ; 
ELSE 

IF s( fc_1 ) T ;s( fc_1 ) < THEN 

O^k — ^maxi 
ELSE 

(1) r . r s (fe-i) T S (fe-i) 1 1 

a k = maX ( Q min, mm I 8(fc _ 1) r ;g(fc _ 1) , "max j ji 

(2) r ■ r S ( fe - i ) T 5;( fc - 1 ) 1 1 

=max|« mm ,mm|_ r ^,a max |); 

IF cJjP /oJjp < T k THEN 

afc = min |aj 2 \ j = max{l, k — M a } , . . . , 
Tfc+i = r fc * 0.9; 

ELSE 

(i) 

a k =a k ; 

ENDIF 

ENDIF 
ENDIF 



To this purpose we adopt the methodology proposed in [25] and based on the notion of 
approximation isochrones. It improves on the comparisons made for a single value of A or 
p, i.e. for a single level of sparsity of the recovered object. 

For values of A in a given interval A m i n < A < A max , one computes the minimizer 
x(X) of (1). When the number of nonzero components in x(X) is not too large, this can 
be done by means of the direct (non-iterative) homotopy method [27] or LARS algorithm 
[17]. Then, for a fixed and given computation time, one runs one of the algorithms for each 
value of A (or p). The relative error = ||x^(A) — x(A)||/||x(A)|| reached at the end of 
the computation is plotted as a function of A and hence this plot is just the approximation 
isochrone showing the degree of accuracy reached in the given amount of computing time 
for each value of A. A set of such plots allow to quickly grasp the performances of a given 
algorithm in various parameter regimes and to easily compare it with other methods; 
it reveals in one glance under which circumstances the algorithms do well or fail. The 
paper [25] also demonstrates the fact that the relative performances of the algorithms 
may strongly depend on the specific application one considers, and in particular on the 
properties of the linear operator K modelling the problem. 

We test the different algorithms on two different operators arising typically either from 
a compressed sensing or from an inverse problem. In both cases the matrix K is of size 
1848x8192. In the first case, the elements of K are taken from a Gaussian distribution 
with zero mean and variance such that \\K\\ = 1. This matrix is rather well conditioned 
and can serve as a paradigm of compressed sensing applications. It is applied to a sparse 
vector and perturbed by additive gaussian noise (about 2%) to yield the data y. The 
second matrix models a severely ill-conditioned linear inverse problem that finds its origin 
in a problem of seismic tomography described in detail in [26]. 

For both operators, the minimizer x(X) is computed for 50 different values of A (or 
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Figure 1: Approximation isochrones in the case of the gaussian random matrix for t = 
6, 12, ... , 60 seconds. 
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Figure 2: The same as Figure 1 but in a semi- log plot. 



equivalently, 50 different values of p). Then, for each iterative algorithm, we make plots 
having the relative error on the vertical axis and log 2 A max /A on the bottom horizontal 
axis (on the top horizontal axis the value of p = \\x\\i is also reported). The number of 
nonzero components m in x(X) is indicated by vertical dashed lines. In each plot we report 
the isochrone lines that correspond to a given amount of computer time. In this way one 
can see how close, for the different values of A, the iterates approach the minimizer after 
a given time. Let us remark that although the reported computing times are of course 
specific to a given computer and implementation, the overall behavior of the isochrones 
should be fairly general. For example, the fact that they get very close to each other in 
some places can be interpreted as a bottleneck feature of the algorithm. 

In Figure 1, we report the results for the ISTA, FISTA, GPSR, SPARS A, PSD and 
our new algorithm GPSS for the case of the gaussian random matrix. The proposed 
GPSS algorithm compares favorably with the other five, especially for small values of 
A. Experiments made by varying the parameter M showing no significant difference, we 
report here only the results obtained with M = 1 (monotonic line search). However, the 
behavior for large penalties is not clearly visible on Figure 1. It is better demonstrated 
when using a logarithmic scale for the relative error on the vertical axes as reported in 
Figure 2. 
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Figure 3: Approximation isochrones for the seismic inverse problem for t = 1, . . . , 10 
minutes. 



In Figures 3 and 4, we report the results for the case of the ill-conditioned matrix 
arising from the seismic inverse problem. Clearly, for this operator, ISTA, GPSR and 
PSD have a lot of difficulty in approaching the minimizer for small values of A (lines 
not approaching e = 0). The FISTA algorithm appears to work best for small penalty 
parameters whereas GPSS and SPARSA compete for the second place in such instance. 
From Figure 4, we see that the GPSS and SPARSA algorithms are performing best for 
large values of A. 

The reported encouraging numerical results call of course for further experiments, but 
we believe that they are sufficiently representative to allow honest extrapolation to reliable 
conclusions holding more generally. As seen, the proposed GPSS algorithm performs well 
for the compressed sensing problem: for small values of A, it clearly outperforms the other 
algorithms (see Figure 1) whereas it is still competitive for larger values of A. In the 
ill-conditioned inversion problem, GPSS an SPARSA appear to perform better than all 
other tested algorithms for large values of A, whereas they are challenged by the FISTA 
method for smaller values. 
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Figure 4: The same as in Figure 3 in a semi-log plot. 
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